TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models
نویسندگان
چکیده
This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (AMs) as filters. There are two main purposes for the design of this hybrid algorithm: (1) to maintain a reasonable recall while improving the precision, and (2) to investigate the proposed association measures on Chinese noun phrase collocations. The performance is compared with a pure statistical model and a pure rule-based method on a 60MB PoS tagged corpus. The experiment results show that the proposed hybrid method has a higher precision of 92.65% and recall of 47% based on 29 randomly selected noun headwords compared with the precision of 78.87% and recall of 27.19% of a statistics based extraction system. The F-score improvement is 55.7%.
منابع مشابه
Noun Phrase Chunking for Marathi using Distant Supervision
Information Extraction from Indian languages requires effective shallow parsing, especially identification of “meaningful” noun phrases. Particularly, for an agglutinative and free word order language like Marathi, this problem is quite challenging. We model this task of extracting noun phrases as a sequence labelling problem. A Distant Supervision framework is used to automatically create a la...
متن کاملSignificant Phrases Detection
The problem of determining key words and phases which best characterize a text document has important applications such as building a compact index for a largescale text processing system, or using a keyword set for summarization and topic detection. We approached this problem from two perspectives. Our knowledgepoor approach is based on statistical collocation detection using the t-test and li...
متن کاملAn Algorithm Combining Statistics-based and Rules-based for Chunk Identification of Chinese Sentences
Natural language processing (NLP) is a very hot research domain. One important branch of it is sentence analysis, including Chinese sentence analysis. However, currently, no mature deep analysis theories and techniques are available. An alternative way is to perform shallow parsing on sentences which is very popular in the domain. The chunk identification is a fundamental task for shallow parsi...
متن کاملv 1 2 2 A ug 2 00 0 A Learning Approach to Shallow Parsing ∗
A SNoW based learning approach to shallow parsing tasks is presented and studied experimentally. The approach learns to identify syntactic patterns by combining simple predictors to produce a coherent inference. Two instantiations of this approach are studied and experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that compare favorably with the best published results are p...
متن کاملShallow parsing of Hungarian business news
The present paper reports on an attempt to annotate noun phrases in Hungarian using cascaded regular grammars. Hungarian presents several difficulties to shallow parsing such as discourse oriented constituent order as well as left-branching recursive possessive and participle structure inside noun phrases. The approach uses cascaded regular grammars and was developed with the CLaRK system. The ...
متن کامل